This data set comprises route information from 41 railway companies, covering 2944 stations across the UK. Nodes represent railway stations, and a link exists between two nodes if at least one service operates between them. The data was sourced from \(\href{https://github.com/andresantoro/Tranet_data/blob/master/README.md}{here}\) and contains a file reporting the stations, a file reporting the railway companies and 41 txt files reporting the routes operated over the stations by each company.
Hence, before importing the data, we merged the information related to the routes and the companies offering them in a single Excel file, obtaining the following data sets:
#nodes <- read_csv("/Users/biagiobuono/Documents/Statistical methods for network data/Trains_UK_dataset/nodes_list.csv")
#colnames(nodes) <- c("station")
#edges <- read_delim("/Users/biagiobuono/Documents/Statistical methods for network data/Trains_UK_dataset/Edges.csv",
#delim = ";", escape_double = FALSE, trim_ws = TRUE)
nodes <- read_csv("/Users/lollocino/Library/Mobile Documents/com~apple~CloudDocs/UNICATT/STATISTICAL METHODS FOR NETWORK DATA/Project/nodes_list.csv")
colnames(nodes) <- c("station")
edges <- read_delim("/Users/lollocino/Library/Mobile Documents/com~apple~CloudDocs/UNICATT/STATISTICAL METHODS FOR NETWORK DATA/Project/Edges.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
In the following chunk, we obtain the geographical coordinates of the network stations and visualize them on a Google Maps map by registering the Google Maps API key.
register_google(key = "AIzaSyAy8v2u9a9VX-Bsm7cNarnW1TAXGiJNiRs")
get_coordinates <- function(station_name) {
result <- geocode(station_name, output = "latlona", source = "google")
return(result)
}
coordinates <- nodes %>%
mutate(geocode_result = map(station, get_coordinates)) %>%
unnest_wider(geocode_result) %>%
select(station, lon, lat)
edges <- edges %>%
left_join(coordinates, by = c("Source" = "station")) %>%
rename(lat_start = lat, lon_start = lon) %>%
left_join(coordinates, by = c("Target" = "station")) %>%
rename(lat_end = lat, lon_end = lon)
uk_map <- get_map(location = 'UK', zoom = 5)
ggmap(uk_map) +
geom_point(data = coordinates, aes(x = lon, y = lat), color = "blue", size = 1) +
geom_segment(data = edges, aes(x = lon_start, y = lat_start, xend = lon_end, yend = lat_end), color = "red", size = 0.3)
ggmap(uk_map) +
geom_point(data = coordinates, aes(x = lon, y = lat), color = "blue", size = 1)
The map illustrates the extensive network of English rail traffic, encompassing all stations across the UK as well as European stations connected to the UK by train.
nodes is a one-column data frame reporting all the railway stations across the UK:
kable(head(nodes))
| station |
|---|
| 3Bdgs Down Thameslink Sdgs |
| Abbey Foregate C.S. |
| Abercynon |
| Aberdare |
| Aberdeen |
| Aberdeen Clayhills Car.M.D |
while edges consists of three columns representing the departure and the arrival stations of the route, and the company operating it:
kable(head(edges))
| Source | Target | Company | lon_start | lat_start | lon_end | lat_end |
|---|---|---|---|---|---|---|
| Birmingham International | Chester | Alliance Rail | -1.890008 | 52.48227 | -2.893075 | 53.19339 |
| Chester | Wolverhampton | Arriva Trains Wales | -2.893075 | 53.19339 | -2.125659 | 52.58682 |
| Cardiff Central Bus Station | Cardiff International APT | Arriva Trains Wales | -3.180344 | 51.47703 | -3.339677 | 51.39852 |
| Cardiff Central | Canton C.S.D. | Arriva Trains Wales | -3.179017 | 51.47614 | -75.205461 | 44.59715 |
| Birmingham New Street | Aberystwyth | Arriva Trains Wales | -1.898537 | 52.47914 | -4.082920 | 52.41530 |
| Milford Haven | Cardiff Central | Arriva Trains Wales | -5.042697 | 51.71431 | -3.179017 | 51.47614 |
Therefore, we are dealing with a directed network data set, meaning that the routes have a direction to their incident stations and they are represented by ordered pairs \((u, v)\), where \((u, v) \ne (v, u)\).
Prior to diving in the analysis of the network, we check for any missing values.
table(is.na(edges))
##
## FALSE TRUE
## 55802 7170
table(is.na(nodes))
##
## FALSE
## 2944
For data cleaning purposes, we check for the presence of any duplicated routes in the edges data frame, by creating a new object that accounts for the number of companies operating on each route.
multiple_edges <- edges %>%
group_by(Source, Target) %>%
summarize(n = n()) %>%
group_by(Source, Target) %>%
filter(n > 1)
kable(head(multiple_edges))
| Source | Target | n |
|---|---|---|
| Aberdeen | Dundee | 2 |
| Aberdeen Clayhills Car.M.D | Aberdeen | 2 |
| Barnham | Brighton | 2 |
| Bath Spa | St. Philips Mrsh H S T D | 2 |
| Beverley | Hull | 2 |
| Birmingham New Street | Birmingham International | 2 |
Finally, we merge the information gained from the previous step with the edges data frame in order to include the number of companies operating on each route.
merged_edges <- edges[,c(1,2)] %>%
left_join(multiple_edges, by = c("Source", "Target")) %>%
mutate(n = ifelse(is.na(n), 1, n)) %>%
distinct(Source, Target, .keep_all = TRUE)
kable(head(merged_edges))
| Source | Target | n |
|---|---|---|
| Birmingham International | Chester | 1 |
| Chester | Wolverhampton | 1 |
| Cardiff Central Bus Station | Cardiff International APT | 1 |
| Cardiff Central | Canton C.S.D. | 2 |
| Birmingham New Street | Aberystwyth | 1 |
| Milford Haven | Cardiff Central | 1 |
In order to get a better understanding of the network, we can visualize it through graphical representations known as graphs, that are pairs \(G = (V, E)\) consisting of sets of vertices and edges.
Moreover, since in the previous analysis we have added some information relative to the number of companies operating on each route, we can use it to set the weights of the edges.
set.seed(348)
# graph with duplicate interactions
g <- graph_from_data_frame(edges, directed = TRUE)
# graph without duplicate interactions, but weighted
g_w <- graph_from_data_frame(merged_edges, directed = TRUE)
# Set the weights of the edges based on the 'n' column
E(g_w)$weight <- merged_edges$n
The following table shows that almost all the edges have been associated with weights equal to 1, meaning that the routes are operated by only a railway company.
kable(table(E(g_w)$weight), col.names = c("Weight", "Frequency"))
| Weight | Frequency |
|---|---|
| 1 | 8569 |
| 2 | 188 |
| 3 | 14 |
| 4 | 1 |
| 5 | 1 |
Two routes only are operated by many companies:
kable(merged_edges[which(merged_edges$n == 5), ])
| Source | Target | n |
|---|---|---|
| Edinburgh | Craigentinny T.& R.S.M.D | 5 |
and
kable(merged_edges[which(merged_edges$n == 4), ])
| Source | Target | n |
|---|---|---|
| Darlington | York | 4 |
suggesting that the stations might be important ones.
V(g_w)$size <- 3 * sqrt(strength(g_w))
V(g_w)$size2 <- V(g_w)$size * .5
E(g_w)$width <- E(g_w)$weight
set.seed(348)
plot(g, vertex.label.cex = 0.5, main = "Unweighted graph")
set.seed(348)
plot(g_w, vertex.label.cex = 0.5, main = "Weighted graph", vertex.label = NA)
Both graphs display some vertices isolated from the others. These represent stations connected by a single route, with one station being the departure point and the other the destination.
To get an idea of the graph size, we can compute its diameter, that represents the longest distance, i.e. the longest shortest path:
diameter(g_w)
## [1] 18
and these are the stations that we pass through by moving along the diameter:
get_diameter(g_w)
## + 19/2941 vertices, named, from e3b41cd:
## [1] Kilwinning Up Siding Lochwinnoch
## [3] Carstairs C.E. Beattock
## [5] Slateford Carriage Sidings Grantshouse C.E.
## [7] Carlisle High Wapping SDGS Lancaster
## [9] Leeds London Kings Cross
## [11] Cambridge North Stratford
## [13] Willesden Jn Low Level Kilburn High Road
## [15] Stonebridge Park Stonebridge Park Depot
## [17] Elephant & Castle L.T. Harrow & Wealdstone DC
## [19] Queens Park A.C.
Given our network graph representation, we try to understand some properties of the system by studying network structure.
One of the primary ways to characterize a vertex is by its degree, which represents the number of edges connected to it. By examining the degree of each vertex, we can derive the degree distribution of the network, that is perhaps the most basic description at the vertex level.
The histogram below reveals that more than half of the vertices have a degree of 3 or less, while only a few vertices exhibit a significantly higher degree. This would lead us to consider the degree distribution to be heterogeneous, since it is clearly characterized by a heavy-tail.
hist(degree(g_w), breaks = 30,
xlab= "Degree",main = "Histogram of degree", col = "dodgerblue4")
abline(v = mean(degree(g_w)), col = 'red')
We can observe an almost linear pattern in the degree behavior. For the lower degree values, there is minimal noise because most observations fall within this range. However, for higher degree values, there is much more variability due to the significantly fewer observations. These type of distributions could be understood by the preferential-attachment model, that yields them tending to a power law.
degree_dist <- function(graph) {
fd <- table(degree(graph)) # number of nodes associated with each degree value
d <- as.numeric(names(fd)) + 1 # degree values
list(d = d, fd = fd)
}
dd <- degree_dist(g_w)
with(dd, plot(log(d), log(fd), main="Degree Distribution", xlab="Log Degree", ylab="Log Frequency of each Degree", pch=16, col="dodgerblue4"))
It is also interesting to simulate a series of subgraphs from the full graph to observe how the average degree varies and to gain insight into its variability. The actual graph average degree is equal to 5.97.
mean(degree(g_w))
## [1] 5.965998
We consider two different sampling schemes: in each we sample n vertices V* \(\subset\) V using random sampling, and then either
set.seed(348)
ng <- vcount(g_w) # number of nodes
n <- floor(ng / 5) # 588
v_star <- sample.int(ng, n)
g_star1 <- subgraph.edges(g_w, E(g_w)[.inc(v_star)], delete.vertices = FALSE)
g_star2 <- induced_subgraph(g_w, v_star)
Under each sampling design, we estimate the average degree with
\[ \eta (G^*) = \frac{1}{n} \sum_{i \in V^*} d_i^* \]
mean(degree(g_star1)[v_star])
## [1] 5.877551
mean(degree(g_star2))
## [1] 1.295918
Now we repeat this procedure for 400 times and represent the results:
set.seed(348)
nmc <- 400
est_mc <- map_df(1:nmc, function (mc) {
v_star <- sample(ng, n)
data.frame(mc = mc, method = c("snowball", "induced"),
estimate = c(mean(degree(g_w)[v_star]), mean(degree(induced_subgraph(g_w, v_star)))))
})
ggplot(est_mc, aes(x = estimate, fill = method)) +
geom_histogram(bins = 100) + geom_vline(xintercept = mean(degree(g_w)))
We can easily observe that Design 2 (induced) significantly underestimates the average degree, whereas Design 1 (snowball) provides a more accurate estimate. In fact, the true value falls in the middle of the snowball distribution, which closely resembles a normal distribution with limited variability.
The transitivity or clustering coefficient is a measure of the degree to which vertices in a graph tend to cluster together. It’s defined as the ratio of the number of triangles to the number of triplets (connected triples of vertices) in the network. High transitivity means that the network contains communities or groups of nodes that are densely connected internally.
transitivity(g_w)
## [1] 0.1723685
Our network exhibits a rather low clustering coefficient, indicating that the nodes are not as strongly connected as one might expect based on the graph’s visual representation.
Strength represents the sum of the edge weights of the adjacent edges for each vertex.
s <- strength(g_w)
hist(s, breaks = 30, main= "Histogram of Strength",
xlab= "Strength", col = "dodgerblue4")
For instance, the station with the highest strength is Endiburgh, that reaches a value equal to 115. It is actually the station with the highest degree and it is involved in the routes that are operated by several companies.
s[which.max(s)]
## Edinburgh
## 115
Betweenness centrality measures how central or influential a vertex is based on the concept of shortest paths. In essence, vertices with higher betweenness centrality are considered more central within the network, as they typically lie on a greater number of shortest paths between pairs of other vertices.
b <- betweenness(g_w)
hist(b, breaks = 100, main= "Histogram of Betweenness",
xlab= "Betweenness", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
Analyzing the plot of betweenness versus strength helps in understanding the importance and role of each node in the network. In general, we observe a fairly linear relationship between the two measures. However, for strength values below 30, nodes tend to exhibit relatively low betweenness.
Moreover, we can identify two nodes that stand out with respect to the others and they represent the stations of Edinburgh and Wembley Eur Frt Ops Cntre.
plot(s, b, ylab="Betweenness", xlab= "Strength", main= "Betweenness vs Strength", pch=16, col=hcl.colors(length(b), rev = F, palette= 'blues'))
Hence, we could expect these two stations to be the most important stations in the network.
ia <- order(b, decreasing=T)[c(1,2)]
name <- names(b[ia])
cat('The two most important actors in the network are', toupper(name[1]), 'and', toupper(name[2]))
## The two most important actors in the network are WEMBLEY EUR FRT OPS CNTRE and EDINBURGH
To conduct a comprehensive analysis of the network, we will examine the behavior of the degree distribution using three models: the Linear Model, the Generalized Linear Model (Poisson), and the Exponential Random Graph Model (ERGM).
mod0 <- lm(log(fd) ~ log(d), data=dd)
cat('model with the fd transformed:', '\n')
## model with the fd transformed:
summary(mod0)
##
## Call:
## lm(formula = log(fd) ~ log(d), data = dd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1255 -0.2257 0.0167 0.2149 1.1240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.51392 0.20619 41.29 <2e-16 ***
## log(d) -1.97674 0.05872 -33.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4257 on 68 degrees of freedom
## Multiple R-squared: 0.9434, Adjusted R-squared: 0.9426
## F-statistic: 1133 on 1 and 68 DF, p-value: < 2.2e-16
Based on the results of the linear model, we find that the estimated intercept value is approximately equal to 8.5: this value represents the expected log-transformed frequency when the log-transformed degree is zero. Additionally, the estimated coefficient for the log-degree is -1.98, indicating the expected change in the log-transformed frequency for a one-unit increase in the log-transformed degree.
Moreover, we observe a Multiple R-squared value of 0.94, meaning that 94% of the variance in the response variable is explained by the model. Furthermore, the extremely low p-value associated with the overall significance of the model provides strong evidence against the null hypothesis, suggesting that the relationship is significant.
mod1 <- glm(fd ~ log(d), family = poisson, data=dd)
cat('model without the fd transformed:', '\n')
## model without the fd transformed:
summary(mod1)
##
## Call:
## glm(formula = fd ~ log(d), family = poisson, data = dd)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.17672 0.03656 223.7 <2e-16 ***
## log(d) -1.80861 0.02110 -85.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 10275.347 on 69 degrees of freedom
## Residual deviance: 79.737 on 68 degrees of freedom
## AIC: 342.33
##
## Number of Fisher Scoring iterations: 4
GLM results show that the estimated intercept is equal to 8.18, while the estimated coefficient for log(d) is -1.81, that represents the average change in the logarithm of the expected frequency for a one-unit increase in the logarithm of degrees. Notably, both coefficients exhibit extremely small p-values (<2e-16), indicating a highly significant relationship between the predictor variable and the response variable.
The deviance values are noteworthy: the null deviance, representing the deviance when only the intercept is included in the model, is 10275, while the residual deviance, representing the deviance after fitting the model, is 79.78. This substantial reduction in deviance suggests that the model explains a significant portion of the variability in the data.
with(dd, plot(log(d),log(fd),main="Degree Distribution", xlab="Degree", ylab="Frequency of each Degree", pch=16, col=hcl.colors(length(b), rev = F, palette= 'blues')))
abline(a=mod0$coef[1],b=mod0$coef[2], col='red', lwd=2)
abline(a=mod1$coef[1],b=mod1$coef[2], col='blue', lwd=2)
legend("topright", legend = c("Linear Model", "Poisson"), col = c("red", "blue"), lwd = 2)
Looking at the plot, both models seem to perform quite well, especially for smaller values of the logarithm of the degree, whereas for higher values the models are less able to properly capture the data structure, since there is much more variability.
The Exponential Random Graph Model (ERGM) analyzes the patterns and the processes within network data, using maximum pseudolikelihood estimation (MPLE), and estimating the impact of the edge predictor on the network structure. Pseudolikelihood is a popular approach for ERGMs especially when the likelihood function becomes computationally intensive or analytically intractable due to the complex dependencies between network edges. It involves calculating not the likelihood of the entire network, but the one of each edge \((i, j)\) being present or absent conditioned on the rest of the network.
Hence, we are dealing with a model where the response variable, represented by the likelihood of the observed network, is a function of the number of edges.
In this case, the estimated coefficient for the edge predictor is -6.89: this negative coefficient suggests that the formation of edges is less likely, indicating a sparse network structure, which is a type of graph where the number of edges is much lower compared to the possible maximum number of edges (\(n \cdot (n-1)\)). Moreover, sparse networks often contain isolated nodes (with no connections) or small disconnected components (subsets of nodes that are only connected within the subset), which is actually our case.
am <- as_adjacency_matrix(g_w, sparse = FALSE)
g_ergm <- as.network(am, directed = TRUE)
ergm(g_ergm ~ edges) %>% summary
## Starting maximum pseudolikelihood estimation (MPLE):
## Obtaining the responsible dyads.
## Evaluating the predictor and response matrix.
## Maximizing the pseudolikelihood.
## Finished MPLE.
## Evaluating log-likelihood at the estimate.
## Call:
## ergm(formula = g_ergm ~ edges)
##
## Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -6.89222 0.01068 0 -645.3 <1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 11986650 on 8646540 degrees of freedom
## Residual Deviance: 138486 on 8646539 degrees of freedom
##
## AIC: 138488 BIC: 138502 (Smaller is better. MC Std. Err. = 0)
This paragraph is dedicated to the clustering analysis, that is useful to reveal potential underlying structures in the networks. Particularly, we employ the Edge Betweenness algorithm, since it is the most suitable for directed graphs as the one we are considering.
g_ceb <- graph_from_data_frame(merged_edges, directed=TRUE)
ceb <- cluster_edge_betweenness(g_ceb)
l_ceb <- length(ceb)
cat('The number of clusters by Edge Betweenness Algorithm are:',l_ceb,'\n')
## The number of clusters by Edge Betweenness Algorithm are: 1583
mod_ceb <- modularity(ceb)
cat('The Modularity Coefficient is:',mod_ceb,'\n')
## The Modularity Coefficient is: 0.1598863
1,583 clusters are identified in the network, a pretty high number considering that we have 2944 nodes in the graph.
Modularity measures the strength of the division of a network into clusters (or communities): a higher modularity score (closer to 1) indicates a strong community structure, where nodes are more densely connected within clusters than between them. In this case, the modularity score of 0.159 suggests that the community structure is not very strong, indicating that the network does not have well-defined clusters.
It is interesting to examine the distribution of cluster memberships to understand how many nodes are in each cluster. The following plot provides a graphical representation of it. The y-axis represents the number of nodes for each cluster, and the x-axis represents the clusters. We can easily observe a significant concentration of nodes in the first clusters, while the others seem to contain just a few nodes.
hist(membership(ceb), xlab= "Clusters", main= "Cluster Memebership distribution", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
Actually, when we sort the clusters by size in ascending order, we find that the most densely populated cluster is the \(14^{th}\), which appears in the first break of the histogram.
tail(sort(table(membership(ceb))))
##
## 222 98 124 278 152 14
## 14 15 20 21 29 884
set.seed(348)
V(g_ceb)$community <- ceb$membership
plot(ceb, g_ceb, vertex.size = 5, vertex.label = NA, asp = .5, main = "Edge Bewteennes Clustering")
This plot illustrates the network’s structure with nodes and edges color-coded to represent different clusters identified by the Edge Betweenness Algorithm. The plot shows a dense and intricate network where nodes are highly interconnected; there is significant overlap between different clusters, as evidenced by the multiple colors densely packed together, indicating weak separability. The presence of many small clusters can be inferred, supporting the idea that the network consists of numerous small communities rather than a few large, well-defined ones.
Since the \(14^{th}\) cluster contains nearly a third of all the network’s stations, we will now focus on this cluster. By creating a subgraph to represent it, we can analyze its properties and behavior in greater detail.
gs <- induced_subgraph(g_w, which(ceb$membership == 14))
plot(gs, vertex.label.cex = 0.5)
hist(degree(gs), breaks = 20,
xlab= "Degree", main = "Histogram of degree (Subgraph)", col = "dodgerblue4")
abline(v = mean(degree(gs)), col = 'red')
transitivity(gs)
## [1] 0.09566887
ss <- strength(gs)
hist(ss, breaks = 30, main= "Histogram of Strength (Subgraph)",
xlab= "Strength", col = "dodgerblue4")
bs <- betweenness(gs)
hist(bs, breaks = 40, main= "Histogram of Betweenness (Subgraph)",
xlab= "Betweenness", col=hcl.colors(length(b), rev = F, palette= 'blues'), border="black")
plot(ss, bs, ylab="Betweenness", xlab= "Strength", main= "Betweenness vs Strength (Subgraph)", pch=16, col=hcl.colors(length(b), rev = F, palette= 'blues'))
ia <- order(bs, decreasing=T)[c(1,2,3,4)]
name <- names(b[ia])
cat('The two most important actors in the network are', toupper(name[1]), ',', toupper(name[2]), ',', toupper(name[3]), 'and', toupper(name[4]))
## The two most important actors in the network are EBBW VALE TOWN , CLEETHORPES , KETTERING and AMERSHAM
degree_dist <- function (g) {
fd <- table(degree(g))
d <- as.numeric(names(fd)) + 1 # degree + 1
list(d = d, fd = as.numeric(fd))
}
dd <- degree_dist(gs)
l <- lm(log(dd$fd) ~ log(dd$d), data = dd)
m <- glm(dd$fd ~ log(dd$d), data = dd, family = poisson)
plot(log(dd$d), log(dd$fd))
#abline(l, col = "red", lwd = 2)
#abline(m, col = "blue", lwd = 2)
#legend("topright", legend = c("Linear Model", "Poisson"), col = c("red", "blue"), lwd = 2)
The subgraph appears to share similar properties with the full graph, with most nodes exhibiting small degrees, low strength, and low betweenness values. However, the most important stations within the subgraph differ from those identified in the complete network.